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Abstract 

Different types of two- and three-dimensional representations of a finite metric 
space are studied that focus on the accurate representation of the linear order 
among the distances rather than their actual values. Lower and upper bounds for 
representability probabilities are produced by experiments including random gen- 
eration, a rubber-band algorithm for accuracy optimization, and automatic proof 
generation. It is proved that both farthest neighbour representations and cluster 
tree representations always exist in the plane. Moreover, a measure of order accu- 
racy is introduced, and some lower bound on the possible accuracy is proved using 
some clustering method and a result on maximal cuts in graphs. 

1 Introduction 

The question of how distance information might be visualized is of importance for 
many sciences including physics, medicine, sociology, and others. Mathematicians 
have early studied the possibility of embedding a finite metric space X_ into other, in 
some sense better spaces like the Euclidean plane or 3-space. Beginning with Menger 



[Men28 1, who gave the precise criteria for X_ to be isometrically embeddable (that is, 
under exact preservation of the distances) into some Euclidean space, most of them 
have focused on mappings that map X_ into some standard space in a "quantitative" 
manner. The goal in this field of research, known under the name metric scaling, is to 
preserve the values of the distances as good as possible, that is, to minimize a certain 



error, known as "stress" (cf. [ |She62| |). 

The aim of this paper is to study more "qualitative" kinds of visualization of dis- 
tance data. In contrast to metric scaling, we will not be interested in the actual values 
of distances but rather in their comparison. Considering only the linear order among 
the distances instead of their value, a measure of order accuracy of a representation is 
introduced. Unlike stress, order accuracy has an easy interpretation as a certain proba- 
bility of correctness. After an experimental exploration of different types of represen- 
tations, a lower bound on the possible accuracy of plane representations will be proved 
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using some clustering method and a result on maximal cuts in graphs. The experimen- 
tal methods include random generation, optimization of accuracy by a rubber-band 
algorithm, and automatic proof generation. All results are summarized in Table [j]. 

2 Order accuracy 

Throughout this paper, X_ — (X, d) is a finite metric space, that is, X is finite, and 
d : X 2 — > [0, oo] fulfils d(x, y) = d(y, x), d(x, y)+d(y, z) > d(x, z), and d(x, y) = 
if and only if x — y. However, one advantage of the follo wing ap proac h is tha t it also 
applies to any finite, symmetric distance set in the sense of [ Hei98[ ] and ftlei02| ], which 



is a far more general type of object than a metric space. For the sake of simplicity, 
we will also assume that X equals the set n = {0, . . . , n — 1} of non-negative inte- 
gers, and that the pairwise distances between the points of X_ are all different, that is, 
d(x, y) = d(x' , y 1 ) > implies {x, y} — {x 1 , y'}. In particular, each x S X has ex- 
actly one nearest neighbour im(x) € X and one farthest neighbour fn(a;) which fulfil 
d{x, nn(i)) < d(x, y) < d(x, fa(x)) for all y € X \ {x, nn(x), fn(x)}. 

We will be mostly interested in representing the points of X by points of either 
some Euclidean space E m , that is, the real vector space R™ with Euclidean distance, 
or the Li-plane M 2 , that is, the set ]R 2 with the "Manhattan"-distance d(x, y) = \xi — 

2/1 1 + \X2 - 2/2 I - 

The order accuracy a(f) of a map / from X_ into some metric space Y_ = (Y, e) 
is defined as the probability that, of two randomly chosen pairs {x, y} and {z, w} of 
elements of X, the one with the larger distance in the "representation" / also has the 
larger "original" distance. More formally, 

«(/) = (^) '\{{{x,y},{z,w}}QnX): 

x^y, z^w, {x, y} ^ {z, w}, and 

d(x, y) < d(z, w) ^=> e(fx, fy) < e(fz, . 

Note that 2a(f) — 1 is just Kendall's rank correlation coefficient g between the two 
linear orders on the (™) pairs {x, y} that result when these pairs are compared with 
respect to either their original or their image distance. Using a variant of the merge- 
sort algorithm, g can be computed in linear-times-logarithmic time, hence a(f) can be 
computed in 0(n 2 logn) time. 



3 Order and weaker representations 

An order representation of X_ in Y_ is some map / : X_ — > Y_ with ct(f) = 1, that 
is, with d(x,y) < d(z,w) <^=4> e(fx,fy) < e(fz,fw). Likewise, an order repre- 
sentation of a (strict) linear order < on the set B(X) of two-element subsets of X is 
a map / : X — * Y_ with {x,y} < {z, w} <J=4> e(fx,fy) < e(fz,fw). It will be 
convenient to identify the metric space X_ with its associated linear order < which is 
given by {x, y} < {z, w} :^=> d(x, y) < d(z, w) here. 



2 



Table 1: Representable fraction of linear orders on B(n) for different kinds of representations and different spaces (open intervals show 
exact bounds, > and < denote estimated bounds, the question mark denotes a conjecture) 
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For Y_ = E n _2> there is always an order representation — there is even a map / for 
which e(fx, fy) — d(x, y) + C for some constant C ^ 0. This was proved by Cailliez 
[Cai83]. A random generation of five-element subsets of E3 confirmed this result for 
n = 5, and a similar experiment showed that all four-element metric spaces not only 
have an order representation in E2 but also in M2. 

To get a feeling how probable a plane order representation is for a five-element met- 
ric space, I also repeatedly drew five-element samples from the uniform distribution on 
the unit square and determined the resulting order among the ten pairwise distances. 
In this way, of the 10! = 3 628 800 linear orders on B(5), at least 53.8% [resp. 65.2%] 
were found to have an order representation in K 2 with the Euclidean [resp. "Manhat- 
tan"] metric. Moreover, at least 66.7% [resp. 67.7%] had a local order representation, 
that is, a map / : X_ — > M 2 such that {x, y} < {x, z} e(fx, fy) < e(fx, fz) for 
all x, y, z, where again e was the Euclidean [resp. "Manhattan"] metric. Judging from 
these empirical numbers, order representability seems to be considerably stronger than 
local order representability in the Euclidean case, but not in the "Manhattan" case. 

Considering only the information coded in the functions nn and fn, it was also 
found that at least 88.3% of the 10! orders had a plane extremal neighbours representa- 
tion, that is, a map / : X_ — > E2 such that nn(/a;) = f(m\(x)) and fn(fx) — f(fn(x)) 
for all x S X. Likewise, at least 93.3% allowed for a map under which both the nearest 
and second-nearest neighbours were represented accurately, and another 3% allowed 
for a map under which at least the information about which points were the two nearest 
to x was represented accurately for all x (see Table |l]). 

In view of the quickly growing number (™) ! of orders on B(n) and the limited 
space for storing the list of orders already found, such a random generation did not 
make much sense for n > 5. It is, however, possible to estimate some similar lower 
bounds at least for n 6 {6, 7} from the following experiment. 



4 Representation by accuracy optimization 

Starting with a randomly generated / : X — > E m , an order representation of a linear 
order < on B(X) can often be produced by a stepwise maximization of order accu- 
racy. The following optimization step proved useful: for each pair {x, y}, {z, w} with 
{x, y} < {z, w} and e(fx, fy) ^ e(fz, fw), move x, y towards each other by some 
fixed fraction of e(fx, fy), and move z, w away from each other by the same fixed 
fraction of e(fz, fw). I have tested this kind of rubber-band algorithm in several ways: 

(i) When < was taken to be the order that corresponded to 8 or 25 independently 
uniformly distributed random points in the unit square, the algorithm found an order 
representation of < in E 2 in about 96% of all cases, no matter if 8 or 25 points were 
taken. For 25 points, the resulting representations were almost similar to the original 
sets. More precisely, for each edge the quotient between its original length and its 
length in the representation was determined, and on average the relative difference 
between maximal and minimal quotient was less than 5% (compared to 12% for 15 
points and over 60% for 8 points). 

(ii) When < was taken from a uniform distribution of all linear orders on B(5), 
the algorithm succeeded in only 45% of the cases. Since, as mentioned before, more 
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than 53% of the orders actually have an order representation, this indicates that the 
algorithm is susceptible to being caught in a local optimum. 

However, in both (i) and (ii), the success of the algorithm did not seem to depend 
on the initial state: when a cluster representation (see below) instead of a random initial 
state was used, only the average number of iterations needed shrinked slightly. 

(iii) As in (i), but for five points in a 100-dimensional cube. Here the success rate 
was about 79%. Such finite subspaces of high-dimensional spaces frequently occur in 
multivariate statistics, for example. 

(iv) Generating the orders as in (ii), an order representation in E 3 of six-point metric 
spaces was found in about 65% of 1000 cases, but of seven-point spaces in only 10.5% 
of 7000 cases. 

The rubber-band algorithm has also been implemented as a Java applet which can 
be tested at 



|http : // www- ifm. math . uni-hannover . de/^heitzig/ distance . 



Despite the algorithm's lack of optimality, we can use these results to estimate lower 
bounds for the fraction of representable orders. As the samples were large enough, one 
can use the approximate confidence bound that arises from the approximation of the 



actual binomial distribution by a normal distribution (see [Kre91]). For a sample of 



size N, s + 1/2 successes, and confidence niveau f3, it has the form 

N + C 2 with c=$- 1 (/3). 

Taking (3 — 0.995, this leads to the following conjectured bounds: 

Conjecture 1 In E 3 , a six- [seven- ] element metric space has an order representation 
with probability at least 60% [9.5%]. 

For six points in E 2 , the same method gives a conjectured lower bound of only 2% (see 
Table [l]). 



5 Disproving local order representability 

A local order representation can also be characterized as a map that preserves the or- 
der among the three sides of any triangle. More precisely, / : X_ — » Y_ is a local 
order representation if and only if for each three distinct points x,y,z 6 X with 
d(x,y) < d{y,z) < d(z,x), also e(fx, fy) < e(fyjz) < e(fz,fx). Using el- 
ementary geometry, one sees that, in the Euclidean plane, the latter is equivalent to 
Zfx fz fy < Zfy fx fz < Zfz fy fx (*). 

Therefore, the existence of a plane local order representation for some order < can 
be disproved by showing that a certain set of inequalities between angles in the plane 
has no solution. The advantage of using angles instead of distances is that the additional 
equations and inequalities which every n-point subset of the plane must fulfil are all 
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linear in the angles: 

(i) Zabc e [0, 7r] 

(ii) Zabc + Zbca + Zcab = it 

(iii) Zazc ^ Zazb + Zbzc 

(iv) Zaz6 + Zbzc + Zcza = 2ir if z is in the convex hull of a, b, c, 

(v) Zazc = Zazb + Zbzc if 6 is "between" a and c as seen from z. 

In search of a local order representation for X_, these linear relations together with those 
of type (*) enable us, starting with the largest interval [0, 7r], to successively narrow 
down the interval of possible values of each angle. If some angle's interval becomes 
empty, there can be no local order representation of this order <. This method can 
also be used to disprove the existence of even weaker kinds of representations such as 
extremal neighbours representations. 

Example 2 Figure |l] shows a computer generated proof that the order {d, e} < 
{a, d} < ■ ■ ■ < {b, d} (listed on top) cannot occur among the distances between five 
points in the plane. Lines 1, 2, and 3 state that certain angles are smaller than 60°, 
smaller than 90°, or larger than 60° because they are the smallest, second smallest, or 
largest in their corresponding triangle, respectively. Line 4 states that only c can be in 
the convex interior of the five points, since each of the remaining four is the farthest 
neighbour of some other. Lines 5-7 apply the "tripod" inequality (iii), using bounds 
already known from lines 1 and 2, this dependence being logged at the end of the lines. 
Line 8 notices a violation of (iv) so that c cannot be in the convex hull of a, b, d. Simi- 
larly, line 9 states that also b cannot be between a and d as seen from c. In line 11, (ii) 
is used to derive a lower bound for a second smallest angle from an upper bound for a 
largest angle. This is the only kind of argument the algorithm can use to derive bounds 
that are not just multiples of 30°. The rest of the proof shall be clear now. 

Note that the premises in lines 1-4 already follow from the information coded in 
the maps nn and fn alone, hence the order under consideration does not even have an 
extremal neighbours representation. 

There is a similar example which shows that it may also be impossible in the plane 
to accurately represent the set of two nearest neighbours of five points. Since for dis- 
joint five-element subsets of some metric space X_, the distribution of the orders that 
correspond to these subsets are independent, we have: 

Corollary 3 For an n-element metric space, the probability of a plane extremal neigh- 
bours representation shrinks exponentially for n — > oo. 

To get explicit upper bounds for local representability, I tested several thousand 
randomly generated orders with this algorithm. For five points, 795 out of 10 000 
orders could be shown to have no plane local order representation in this way. Using 
again estimated confidence bounds with (3 — .995, this results in an estimated upper 
bound of .928 for the fraction of plane locally order representable orders on 0(5). For 
n = 6, 7, 8, and 9, the corresponding numbers were 4156 out of 10000, 3627 out of 
4500, 1 1 690 out of 12 000, and 9990 out of 10 000, respectively, resulting in the upper 
bounds shown in Table |IJ 
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Figure 1: A computer generated non-representability proof. 



TEST OF EDGE ORDER de < ad < ac < ab < ce < be < be < cd < ae < bd 
USING ONLY EXTREMAL NEIGHBOURS INFORMATION 

legend: points are labeled a,b,c,d,e 

xy is a segment , xyz is a triangle, x:yz is the angle in xyz at vertex x 
x:ywz means that x:yz=x:yw+x:wz 

follows 

line type proposition from 
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larger 
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: cbe since a : ce<a : bc+a : be 
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CASE ANALYSIS using points a, bed: 



16. (i) ASSUMING a:bcd. . . 

17. sum a:bd =a:bc+a:cd > 60+60= 120 16.3.3. 

18. sum a:bc =a:bd-a:cd < 150-60= 90 16.5.3. 

19. tripod a:be >=a:bd-a:de > 120-60= 60 17.1. 

20. not a:bec since a : bc<a : ce+a : be 18.12.19. 

21. hence a in bee 14.13.20. 

22. contradiction! 21.4. 

23. (ii) ASSUMING a:cbd... 

24. sum a:cd =a:bc+a:bd > 60+60= 120 23.3.3. 

25. sum a:bc =a:cd-a:bd < 150-60= 90 23.6.3. 

26. tripod a:ce >=a:cd-a:de > 120-60= 60 24.1. 

27. not a:bec since a : bc<a : be+a : ce 25.11.26. 

28. hence a in bee 13.14.27. 

29. contradiction! 28.4. 

30. (iii) ASSUMING a:bdc... 

31. not d:acb since a:bdc 30. 

32. not d:abc since a:bdc 30. 

33. hence d:bac 31.4.32. 

34. not c:bad since a:bdc 30. 

35. hence c : adb 8.9.34. 

36. new sum c : abd since ad diag in cabd 30.33. 

37. new circ d in abc since a : bde and c : adb 30.35. 

38. contradiction! 37.4. 

39. (iv) ASSUMING a in bed... 

40. contradiction! 39.4. 



CONTRADICTION in all four cases! 
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Figure 2: A "universal" nearest neighbour graph of nine points in the plane 




Conjecture 4 In E 2 , a six-element metric space has a local order representation with 
probability at most 60%. 

This fast vanishing of the probability of plane local order representability on the one 
hand shows that the above algorithm is quite successful, and on the other hand moti- 
vates the study of even weaker kinds of plane representation. 



6 Nearest and 

farthest neighbour representations 

The directed graph G nn (X) with vertex set V(G) — X and edge set E(G) — 
{(x,nn(a;)) : x € X} is known as the nearest neighbour graph of X_. Asymptotic 
properties of nearest neighbour graphs of subsets of the plane have been studied in 



[EPY97|. The farthest neighbour graph of X_ is defined similarly. By a down-tree I 
mean a finite connected digraph all of whose vertices have out-degree one, except for 
a root vertex with out-degree zero. 

Proposition 5 A finite digraph G is a nearest [farthest] neighbour graph of a metric 
space if and only if each of its components is a disjoint union of two down-trees whose 
roots are joined by a double edge. 

Since the proof is easy but quite technical, it is omitted here. 



The digraphs characterized by this result will be called bi-rooted forests in the se- 
quel, and a pair of roots will be called a bi-root for short. A proper child of a vertex x 
in a digraph is a vertex y for which there is an edge (y, x) but no edge (x, y). 

Proposition 6 A bi-rooted forest of size at most nine occurs as a nearest neighbour 
graph in the plane if and only if no vertex has more than four proper children. 
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Proof. Let G be a bi-rooted forest with | V(G)| ^5 9. If some vertex x has five proper 
children x\, . . . , x$, there is no nearest neighbour representation in E 2 . Otherwise, for 
i j, the longest side of the triangle XiXjX would be X{Xj, hence the angle between 
the segments XiX and XjX would be larger than 7r/3. Likewise, the longest side of the 
triangle XiX 1111(2;) is Xi nn(a;), hence the angle ZxiX nn(x) would also be larger than 
7r/3 which is impossible in the plane. 

On the other hand, one can verify that all bi-rooted forests with at most nine vertices 
and without vertices that have more than four proper children fit into the "universal" 
forest sketched in Figure |[ Each of its four components is constructed from its two 
roots (joined by a double edge of length 100) by successively adding children, where 
the edges originating from children of order n have length 100 + n and share a mutual 
angle of (65 + i — n)° if they are neighboured. Since in that figure, each edge points 
towards the nearest neighbour, the proposition is proved. □ 

Using this result, it was possible to calculate the fractions of linear orders on B{n) 
with a plane nearest neighbour representation shown in Table [j]. Note that for n = 10, 
the analogue of the above proposition is false, a counter-example being the bi-rooted 
forest consisting of two connected roots with four children each. 

As for nearest neighbour representations in E 3 , it was proved by Fejes Toth [ ^T43 ] 
that of n points on a unit sphere in E 3 , at least two must have a distance of at most 



2 n 
cosec^ 



2 6 

In particular, 5^ 1=3 0.98, hence there exist no fourteen points on the unit sphere with 
pairwise distance larger than one. In other words, of fourteen rays in E 3 with a common 
source, at least two have an angle of at most 60°. Therefore, a bi-rooted forest with a 
root that has thirteen children cannot have a representation in E 3 . In particular, not all 
linear orders on B(15) have a nearest neighbour representation in E 3 . However, one 
may hope that at least all linear orders on B(13) have a representation since there exist 
twelve such points on the sphere. 

Conjecture 7 Every metric space of up to thirteen elements has a nearest neighbour 
representation in E3. 

Note that <5i3 ~ 1.014 > 1, and the empirically supported conjecture that there are no 
thirteen such points is still unproved — this might show that questions of representabil- 
ity of larger sets might also be quite difficult. 

Surprisingly, a small degree at all vertices of the nearest neighbour graph does not 



assure plane nearest neighbour representability: Eppstein, Paterson, and Yao [EPY97] 
could show that for a subset X of E 2 , \X\ = 0(£)(G nn (X)) 5 ), where D(G) is the 
depth of G, that is, the maximal length of a path from a vertex to the corresponding root. 
Using their exact bounds, one can show that for example the complete binary bi-rooted 
tree with 2 66 — 2 w 10 20 vertices does not have a nearest neighbour representation in 
E 2 . However, it seems likely that already far smaller binary trees fail to have one. 

Eppstein et al. also showed that the expected number of components of G nn (X) is 
asymptotic to approximately 0.31 \X\ if the points of X are independently uniformly 
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distributed in the unit square. More precisely, the probability for a vertex to belong to 
a bi-root is 6tt/(8tt + 3y/S) w 0.6215 in that case. From this it is also clear that the 
expected fraction of elements of X that are not the nearest neighbour of some other 
element is at most 0.2785. However, the smallest exact upper bound to this fraction is 
far larger: 

Proposition 8 In any finite subset o/E 2 , at most 7/9 of its elements are not a nearest 
neighbour of some other element, and this bound is sharp. 

Proof. It is quite easy to see that the bi-rooted forest consisting of a root with four and 
another with three children has a nearest neighbour representation in E 2 , hence 7/9 is 
possible. 

On the other hand, let C be a component of the nearest neighbour graph of a finite 
subset of the plane. Then its roots r and q together have k ^ 7 children, and C can be 
constructed from these k + 2 vertices by subsequently adding fcj < 4 children to some 
end vertex, thereby increasing the number of end vertices by fcj — 1 in step i. Thus, the 
final fraction of end vertices in C is 

(k + 2) + h ^ 9 

since 7(k + 2 + J2 i ki)-9(k + J2 i (k-'L)) = 14-2/c+9s-2 £. h > 9s-2-4s>0, 
where s is the number of steps needed. □ 

In view of these facts about nearest neighbour graphs, the following might be a bit 
surprising: 

Theorem 9 Every finite metric space has a farthest neighbour representation in E 2 . 

Proof. Let G := G{ n (X) be the corresponding farthest neighbour graph, D its depth, 
and define an infinite bi-rooted forest H as follows. The vertices of H are labelled 
djt and bj t , where j is a non-negative integer and t runs over all tuples of at most D 
non-negative integers, including the empty tuple 0. The bi-roots are the pairs {a^ , bj% } 
with non-negative integer j, each vertex k m ^ is a child of k y and each vertex 
bj(....k.m) is a child of ..,/;). In other words, H has countably many isomorphic 
components (numbered by j), and each vertex has countably many children, up to 
depth D. This digraph H contains an isomorphic copy of G, hence it suffices to give 
a representation of H. To address points of the plane, it will be convenient to identify 
M 2 with the set C of complex numbers in the usual way. 

For each non-negative integer j, let Cjo and Cji be the circles of radius 2 with 
centres Cj := e 2 ° 1? ™ and Cji := eS 1+2 3 respectively. These curves can be 
parametrized using the following functions, where the coefficients Xj > will be 
determined later: 

f j0 (O--=c j0 + 2e(*- 3 - 1 +^ and f jX (£) := Cjl + 2eW~ 1 +W . 

In particular, /,o(0) = 3c j0 , fji (0) = 3c,i, F j0 := f j0 [I] C C j0 , and F ]X := f jx [I] C 
Cj\, where / = [— 2 D , 2 D ] C E. Now the coefficients Xj are chosen small enough so 
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that 2 D Xj < ir/2 and so that the smallest distance between the sets Fjq and Fj\ is still 
larger than the largest distance between a point in Fj U Fji and a point in Fko U Fki 
for any k ^ j. This ensures that, for q e {0, 1} and all £ e /, the unique point 
in (J. Fko U Ffei that is farthest away from the point fj q (£) is the point /j i i_ 9 (£/2). 
More generally, given q e {0, 1} and £, /3, 7 e /, we have 



- > 1/^(0 - /i,i-,(7)l ^ 1/3 ~ C/2| < |7 - ?/2| (*). 



Using this equivalence, one sees that the following recursive definition results in a 
farthest neighbour representation / of H : 



/(0jt) : =/j,9(t)K(*)) and /(&jt) : =/j,i-<i(t)(-f(*))> 

where the bi-roots have q(0) := and £(0) := 0, their children have q((m)) := 1 and 
£((m)) := 1 + 2 _m , and all others have q((. . . , k, to)) := 1 - q((. . . , k)) and 

£((..., fc, to)) := 2£((...,£0)-(l-2- m )(£((...,£0) -£((..., fc + 1))) 
= (l + 2- m )£((...,fc)) + (l-2- m )£((...,fc + l)). 

Because of (★), we need only verify that (i) |0 - £((to))/2| < |£((fc, I)) - £((m))/2|, 
which is true because of £((m)) < 2 < £((fc, £)), and that (ii) 

\m- •■,«!))- • ■ , m))| < |2£((. . . , fc ± 1)) - £((. . . , fe, m))|, 

where the left hand side equals (1 - 2~ m )c with c = (£((. . . , fe)) - £((. . . , fc + 1))), 
and the right hand side is the absolute value of c + 2(£((. . . , k ± 1)) — £((. . . , k, to))) 



which is larger than c in the "— " case and smaller than — c in the "+" case. □ 

7 Cluster representations, 

and lower bounds for accuracy 

A important question in applications of finite metric spaces is that of clustering the 
elements into homogeneous, mutually heterogeneous groups. Formally, a hierarchical 
clustering of X_ produces what I will call a cluster tree here, which can be formalized 
as a chain of partitions V\ , . . . , V n on X, where V\ — {{x} : x e X} is the discrete 
and V n = {X} the indiscrete partition, and each Vk+i with k < n arises from Vk by 
joining two clusters, that is, replacing some AjBeVu by their union A U B. Most 
common clustering methods fulfil the following property (*): if k < n, A, B G Vk, 
A ^ B, and for all a g A, b g B, and x, y g X, either x, y g A U B, or z, y g C 
for some C g "Pfc, or c?(a, b) < d(x, y), then AU B e Vk+i- In other words, when all 
distances between members of A and B are smaller than all distances between points 
of other clusters, then A and B are joined next. Now, a cluster tree for X is said to 
have a cluster representation f : X — > Y_ when all clustering methods that fulfil (*) 
reproduce this cluster tree when they are applied to the metric space X_' := (X,d') 
with d'(x,y) := e(fx,fy). 
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Proposition 10 Every cluster tree T\ , . . . , V n for a finite set X has a cluster repre- 
sentation in {0, . . . , + \/2)™/4j } with Euclidean distance. 

Proof. Inductively, we construct maps : X — > Z and integers Si such that /„ is a 
cluster representation, and each fi is already "correct" for all C E V%. For C E V%, the 
convex hull of f t [C] will be the interval [0, w^C)]. For A, B E V t and A U B E P i+ i, 
fi+i[A U B] will be constructed by placing fi[A] and fi[B] besides each other at a 
distance Si that is larger than the diameter of any C E Vi, that is, with Si > Wi(C). 

We start with /i(a) := for all ael.so that wi(A) = for all A E V\, and put 
S\ := 1. For i ^ 1, let Ai, Bi E V L be those elements with Ci := Ai U B. L E Vi+i and 
min Ai < min Bi . Now put 



fi+i(a) 
fi+i(b) 



= fi (a) for all a E Ai, 

= f l {b) + S l + w l {A) for all b e B it 

= fi(x) for all x Ci, 



and5 i+ i := iUj + i((7j) + 1, where, by construction, «^+i(Cj) = 5 i + Wi(A i ) + Wi(B i ). 
Then the convex hull of /,+ifCj] is [0, Wi+i(Ci)] as proposed. For all C E Vi+i 
different from Ci, we have C E Vi and thus Si + i > Si > Wi(C) = u»i+i(C) as 
required. Incasethati ^ 2, one of Ai, Bi is in Vi-i, hence either Wi(Ai) = Wi-i(Ai) 
or Wi(Bi) = Wi-i(Bi). Putting rrii := max{?iii(A) : A G Pj}, this gives m,; + i ^ 
2mj+mi_i+l. It is easy to verify that the corresponding recursive upper bound bi with 
bi+\ = 26i + fei_i + l and initial conditions b\ = 0and&2 = 1 is bi = ((l + v / 2) 1 + (l — 
V / 2) i )/4-l/2 = L(l + \/2)V4j. In particular, w n (X) = m n < 6„ = [(1 + V^) n /4J. 

Finally, /„ is a cluster representation: let i ^ n, a E Ai, b E Bi, A' ^ B' E Vi 
with {A', B'} ^ {A l , Bi], and a' E A', b' E B'. Then the smallest index j for which 
there is C E Vj with a' , b' E C is at least i + 1, hence df n (a, b) = dfi(a, b) < Si ^ 
<Sj_i ^df 3 (a',b') = df n (a',V). □ 

Finally, this construction can be used to show that the following lower bound on 
order accuracy for maps into the real line: 

Theorem 11 For every n-element metric space X_ with n = 2 P for some integer p, 
there is a map f : X_ — > Ei with order accuracy at least 3/7 — 0(1/ n). 

Proof. We iteratively define a binary cluster tree. For k < n, Vk is constructed from 
Vk+i as follows: choose some C E Vk+i of maximal size, and let wc({x, y}) be the 



number of pairs {z, w} C C with < d(z, w) < d(x. y). In [PT86] it was proved that 



there is a partition of C into two sets A and B of equal size such that 

1 1 /( |c| 

Y wc({x,y})> 2' w c{{x,y}) = 2 • ( 2 

Let Pfe := Vk+i \ {C} U {A, B}. Note that w c ({a:,2/}) is now the sum of 
wa,b{{x, y}), the number of pairs {z,w} C C with < < d(x,y), 

z E A, and w E B, and of u/^ y}), the number of pairs {z,w} C C with 

< w) < d(x, y) and either z, w E A or z, w E B. 
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Now we construct a representation as in the previous proposition, except that we 
might sometimes use //(a) := Wi(Ai) — /,(a) and f[(b) :— Wi(Bi) — fi(b) instead of 
fi (a) and fi (b) for the definition of fi + i \ c t ■ More precisely, when fi has already been 
defined and Aj, Bi, Ci are as in the proposition, let 7 be the number of quadruples 
(x,y,z,w) € Ai x 8; x A, x 8; with < d(z,w) < d(x,y) and fi(w) — fi(z) < 
fi(y) — fi(x), and let 7' be the number of quadruples (x,y, z,w) S Ai x Bi x Ai x Bi 
with < d(z,w) < d(x,y) and /^(z) — < fi(x) — fi(y)- These numbers 

tell how many pairs of edges between Ai and Bi will be represented with the correct 
order of lengths when either fi or f[ is used for the definition of fi+i\d- Now put 
fi+i(x) :— fi(x) for all x ^ Ci, and either 

fi+i(a) := fi(a) for all a e Ai , and 
f l+1 (b) := Mb)+S l + w l {A) for all b £ B, 

if 7 ^ 7', or otherwise 

fi+i(a) := f-(a) for all a £ Ai , and 

fi+i{b) := fi{b) + Si + Wi(A) for all feeB 4 . 

This assures that \f i+ i(x) - fi+i{y)\ > \fi+i(z) - fi+i(w)\ whenever x € Ai, y € 
Bi, and either z,w £ A, or z,w £ 8;. Moreover, since the sum of 7 and 7' is 
(l^jj 5 *!), their maximum is at least |A* 1 1.8,1(1 A* - l)/4. Hence, this step i of the 
construction contributes to the overall accuracy a a summand with 

' xeAi, yEBi 



ST I (S x\ a n , I A, 1 [A- 1(1 A- 11-5*1-1) 
2^ (m-'c< ({a;, y}) - ^ <1 b < ({x, j/j) + - 



E w cA{x,y}) - . 



AiHBilN , l^llSiKIA-llBil-l) 



2 V 2 / 4 64' 

Finally, all Ci are of size n/2 9 for some q with ^ q < p, and there are exactly 2 q 
many of this size. Hence the overall accuracy is 



p-i 

E 

i=l q=0 



E «* > E 2 * • ^ 1 /2") 4 + 0(l/n) = I - 0(l/n). 



□ 



However, this lower bound is very likely not the best possible. The rank correla- 
tion g between two independently chosen linear orders on m elements is nearly nor- 



mally distributed with expected value and standard deviation 0(1/ ^/m) (cf. [ K.G9C ]). 
Hence (g + l)/2 has expected value 1/2, which motivates the following conjecture. 

Conjecture 12 Every finite metric space can be mapped into Ei with accuracy 1 /2. 
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